Domain Independent Authorship Attribution without Domain Adaptation
نویسندگان
چکیده
Automatic authorship attribution, by its nature, is much more advantageous if it is domain (i.e., topic and/or genre) independent. That is, many real world problems that require authorship attribution may not have in-domain training data readily available. However, most previous work based on machine learning techniques focused only on in-domain text for authorship attribution. In this paper, we present comprehensive evaluation of various stylometric techniques for cross-domain authorship attribution. From the experiments based on the Project Gutenberg book archive, we discover that extremely simple techniques based on stopwords are surprisingly robust against domain change, essentially ridding the need for domain adaptation when supplied with a large amount of data.
منابع مشابه
Domain Adaptation for Authorship Attribution: Improved Structural Correspondence Learning
We present the first domain adaptation model for authorship attribution to leverage unlabeled data. The model includes extensions to structural correspondence learning needed to make it appropriate for the task. For example, we propose a median-based classification instead of the standard binary classification used in previous work. Our results show that punctuation-based character n-grams form...
متن کاملDeep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning
Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...
متن کاملImage alignment via kernelized feature learning
Machine learning is an application of artificial intelligence that is able to automatically learn and improve from experience without being explicitly programmed. The primary assumption for most of the machine learning algorithms is that the training set (source domain) and the test set (target domain) follow from the same probability distribution. However, in most of the real-world application...
متن کاملSample-oriented Domain Adaptation for Image Classification
Image processing is a method to perform some operations on an image, in order to get an enhanced image or to extract some useful information from it. The conventional image processing algorithms cannot perform well in scenarios where the training images (source domain) that are used to learn the model have a different distribution with test images (target domain). Also, many real world applicat...
متن کاملBlogs, Twitter Feeds, and Reddit Comments: Cross-domain Authorship Attribution
Stylometry is a form of authorship attribution that relies on the linguistic information to attribute documents of unknown authorship based on the writing styles of a suspect set of authors. This paper focuses on the cross-domain subproblem where the known and suspect documents differ in the setting in which they were created. Three distinct domains, Twitter feeds, blog entries, and Reddit comm...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011